-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Pdfplumber: Integration #12949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pdfplumber: Integration #12949
Conversation
ennamarie19 has previously contributed to projects/pdfplumber. The previous PR was #12567 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @ennamarie19
Thank you, @ennamarie19. I'm the maintainer of Is there a way to set up the fuzzer to ignore errors that originate with that dependency? That would help me focus on issues I can directly fix in |
Typically, we could add an exception handler in the fuzz harness for certain exceptions that are raised that we aren’t interested in. Ideally, it would be the base class of a library’s custom exception (ie, PSException).
However, in this case, that is a genuine bug in a library that you depend on.
I see two main options: We could ignore them using a catch that parses the exception trace back and filter for pdfminer or add try catches in the code when calling into pdfminer to handle exceptions from sub calls more robustly and prevent pdfplumber crashing unexpectedly. Which do you prefer?On Jan 29, 2025, at 9:11 PM, Jeremy Singer-Vine ***@***.***> wrote:
Thank you, @ennamarie19. I'm the maintainer of pdfplumber. I've started receiving results of the fuzzing via email. Some look helpful, while others appeared to be triggered by problems with a core dependency, pdfminer.six. For example, this one: https://oss-fuzz.com/testcase-detail/5914823472250880
Is there a way to set up the fuzzer to ignore errors that originate with that dependency? That would help me focus on issues I can directly fix in pdfplumber.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
... thanks to OSS-Fuzz and @ennamarie19 Cf.: google/oss-fuzz#12949
Thanks, @ennamarie19. I've pushed a commit that handles exceptions stemming from Would it make sense to have the fuzzer then ignore these particular exceptions?: from pdfplumber.utils.exceptions import MalformedPDFException, PdfminerException |
@jsvine Good idea! I've updated the scripts to catch and ignore those exceptions. However, I was fuzzing the main branch. I will do a PR here to update to fuzz the develop branch instead |
Great, thanks! |
commit 0dd4925 Author: Jeremy Singer-Vine <[email protected]> Date: Thu Jun 12 07:31:46 2025 -0400 Update CITATION.cff commit c6a24be Author: Jeremy Singer-Vine <[email protected]> Date: Thu Jun 12 07:23:30 2025 -0400 Bump version to 0.11.7 commit 51f3065 Author: Jeremy Singer-Vine <[email protected]> Date: Thu Jun 12 07:21:29 2025 -0400 Update CHANGELOG.md commit 738f6f0 Author: Jeremy Singer-Vine <[email protected]> Date: Wed Jun 11 23:40:50 2025 -0400 Add test for CLI auto-help commit b88907f Author: mara004 <[email protected]> Date: Fri May 2 23:07:05 2025 +0200 Minor cleanup around pypdfium2 integration commit 7e364e6 Author: Jeremy Singer-Vine <[email protected]> Date: Wed Jun 11 22:24:28 2025 -0400 Add Page.trimbox, .bleedbox, .artbox (jsvine#1313) Thanks to @samuelbradshaw for the suggestion! commit 4c7e092 Author: Jeremy Singer-Vine <[email protected]> Date: Fri May 16 08:20:30 2025 -0400 Upgrade pdfminer.six from 20250327 to 20250506 ... and adjust color handling accordingly. commit 3e0d4df Author: Jeremy Singer-Vine <[email protected]> Date: Wed Jun 11 23:26:09 2025 -0400 Run make format commit cd6fd70 Author: nobody <[email protected]> Date: Mon May 19 08:31:53 2025 -0400 Auto-add --help if CLI run w/o args (Commit message edited by @jsvine.) commit 02ff431 Author: Jeremy Singer-Vine <[email protected]> Date: Thu Mar 27 23:21:17 2025 -0400 Tiny tweaks to CHANGELOG.md commit 8cd8e48 Author: Jeremy Singer-Vine <[email protected]> Date: Thu Mar 27 23:15:41 2025 -0400 Bump version to 0.11.6 commit 44b078c Author: Jeremy Singer-Vine <[email protected]> Date: Thu Mar 27 23:15:06 2025 -0400 Update CHANGELOG.md commit e15ed98 Author: Jeremy Singer-Vine <[email protected]> Date: Thu Mar 27 22:44:25 2025 -0400 Fix bug w/ use_text_flow=True extractions (jsvine#1279) ... related to flows where text bounces between lines. h/t @samuelbradshaw commit f2ad942 Author: Jeremy Singer-Vine <[email protected]> Date: Thu Mar 27 22:00:14 2025 -0400 Add another oss-fuzz test case, already fixed commit 748ff31 Author: Jeremy Singer-Vine <[email protected]> Date: Thu Mar 27 21:58:17 2025 -0400 More broadly handle RecursionError, via oss-fuzz commit 9148810 Author: Jeremy Singer-Vine <[email protected]> Date: Thu Mar 27 21:57:21 2025 -0400 Fix unhandled None in do_PDFStream, via oss-fuzz commit 3fcb493 Author: Jeremy Singer-Vine <[email protected]> Date: Thu Mar 27 21:31:06 2025 -0400 Bump pdfminer.six to version 20250327 commit 7e28e76 Author: Jeremy Singer-Vine <[email protected]> Date: Tue Mar 25 23:03:13 2025 -0400 Remove test_issue_1089 (jsvine#1263) @booxter makes a good point that the test is platform-specific. The issue has been resolved, and it's not expected to return, so I think provisionally OK to remove this test. commit 630f30e Author: Jeremy Singer-Vine <[email protected]> Date: Tue Mar 25 22:52:47 2025 -0400 pragma:nocover exceptions no longer raised by pdfminer.six commit 12a73a2 Author: Jeremy Singer-Vine <[email protected]> Date: Tue Mar 25 22:52:16 2025 -0400 Bump pdfminer.six to version 20250324 commit 6349adb Author: Jeremy Singer-Vine <[email protected]> Date: Mon Feb 10 22:09:28 2025 -0500 Add escapechar for .to_csv(...) commit 980494a Author: Jeremy Singer-Vine <[email protected]> Date: Mon Feb 10 21:54:10 2025 -0500 Use csv.QUOTE_MINIMAL for .to_csv(...) commit 47a7ab8 Author: Jeremy Singer-Vine <[email protected]> Date: Mon Feb 10 21:53:17 2025 -0500 Update exception handler commit 8f5f498 Author: Jeremy Singer-Vine <[email protected]> Date: Sun Feb 9 17:23:37 2025 -0500 Fix wrong exception expectation in test commit 43ccc5b Author: Jeremy Singer-Vine <[email protected]> Date: Sun Feb 9 16:23:57 2025 -0500 Catch exceptions from pdfminer and malformed PDFs ... thanks to OSS-Fuzz and @ennamarie19 Cf.: google/oss-fuzz#12949 commit a77808a Merge: c562774 5d47d5a Author: Jeremy Singer-Vine <[email protected]> Date: Sun Feb 2 11:16:58 2025 -0500 Merge pull request jsvine#1270 from mara004/patch-1 test_issue_1089: update wording regarding pypdfium2 commit 5d47d5a Author: mara004 <[email protected]> Date: Sun Feb 2 16:27:53 2025 +0100 test_issue_1089: update wording regarding pypdfium2 See jsvine#1089 (comment) for background commit c562774 Author: Jeremy Singer-Vine <[email protected]> Date: Wed Jan 1 10:21:18 2025 -0500 Bump version to 0.11.5 commit 4af0e1d Author: Jeremy Singer-Vine <[email protected]> Date: Wed Jan 1 10:21:00 2025 -0500 Update CHANGELOG.md commit 7c63541 Author: Jeremy Singer-Vine <[email protected]> Date: Wed Jan 1 10:26:04 2025 -0500 Add thanks to @stolarczyk in README.md commit 078df97 Author: Jeremy Singer-Vine <[email protected]> Date: Tue Dec 31 09:11:32 2024 -0500 Fix jsvine#1237 (tf → table_settings) h/t @n-traore And thanks to @cmdlineluser for the nudge. commit 6e54799 Author: Jeremy Singer-Vine <[email protected]> Date: Sat Dec 28 12:13:32 2024 -0500 Add thanks to @brandonrobertz (jsvine#1235) commit 69d010a Author: Jeremy Singer-Vine <[email protected]> Date: Sun Dec 15 23:24:31 2024 -0500 Add initial test/docs for `format --text` (jsvine#1235) commit e0ee254 Merge: 28d4f50 f3f2b57 Author: Jeremy Singer-Vine <[email protected]> Date: Sun Dec 15 23:07:14 2024 -0500 Merge pull request jsvine#1235 from brandonrobertz/add-text-output-mode Add a --format text option commit f3f2b57 Author: Brandon Roberts <[email protected]> Date: Tue Dec 10 14:21:22 2024 -0800 Add a --format text option I use this regularly because pdfplumber has among the best layout preserving methods for PDFs, especially machine generated ones. Exposing the page output via CLI lets me use pdfplumber as a general purpose PDF-to-text tool. Usage: pdfplumber --format text file.pdf > file.txt commit 28d4f50 Merge: ea3b3e5 2073164 Author: Jeremy Singer-Vine <[email protected]> Date: Sun Dec 8 23:10:15 2024 -0500 Merge PR jsvine#1195 commit 2073164 Author: Jeremy Singer-Vine <[email protected]> Date: Sun Dec 8 22:55:30 2024 -0500 Appease linter commit c80c78d Author: Michal Stolarczyk <[email protected]> Date: Fri Nov 22 16:48:19 2024 +0100 add a test to cover raise_unicode_errors parameter commit 1e4b48a Author: Jeremy Singer-Vine <[email protected]> Date: Fri Nov 22 08:18:11 2024 -0500 Run 'make format' and ignore code line-length commit 138abab Author: Michal Stolarczyk <[email protected]> Date: Wed Nov 13 18:34:35 2024 +0100 rename warn_unicode_error to raise_unicode_errors for clarity additionally change the default accordingly commit ea3b3e5 Merge: 6ef62c9 8542adb Author: Jeremy Singer-Vine <[email protected]> Date: Sun Nov 10 22:47:33 2024 -0500 Merge pull request jsvine#1221 from erghelium/develop Fix broken link to Anssi Nurminen's master's thesis in the README.md commit 8542adb Author: Guilherme <[email protected]> Date: Sun Nov 10 18:19:04 2024 -0300 Fix broken link to Anssi Nurminen's master's thesis in README commit 6ef62c9 Author: Jeremy Singer-Vine <[email protected]> Date: Wed Oct 2 21:11:38 2024 -0400 Add `name` property to `image` objects (jsvine#1201) h/t @djr2015 commit 396c5e3 Author: Michal Stolarczyk <[email protected]> Date: Fri Aug 30 10:24:39 2024 +0200 warn on unicode decoding errors in PDF annotations in some cases the the annotations may contain some junk that hinders annotations processing altogether. This allows to ignore the error and warn instead, which is configurable via warn_unicode_error arguments in the PDF initializer and/or open() method.
This pull request integrates the Dockerfile needed to build the fuzzers for pdfplumber.
Note: The fuzzers were NOT merged upstream following discussion with the project maintainer here and with the precedence for out-of-repo fuzzers established here